In the past few decades, machine learning was more focusing on how to analyze the past data using them for future data, while the next decade, Kevin think it will be more conversation and interaction. Bandits can hence close the loop on inferences and data collection to balance the exploration and exploitation to optimally complete a task. Bandits’ problem is named after Casino, which is a way to find an optimal policy to pull down slot machines. The bandits’ problems can be generally modeled as following:
\begin{algorithm}
\caption{General Bandits' Problems}
\begin{algorithmic}
\Require $N$ arms of slot machines
\PROCEDURE{GeneralBandits}{$N$}
\FOR{$t =1, ..., T$}
\STATE Algorithms pull arm $I_t \in [N] := \{1, 2, ..., N\}$
\STATE Nature reveals $r_t \sim \mathbb{P}_{I_t}$, $\mathbb{E} \left [ r_t | I_t \right ] = \theta_{I_t}^*$
\ENDFOR
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Which means supposed we have different i.i.d. with unknown means, the things we can do is simultaneously explore and exploit the situation.
One of the goals of the bandits’ problems is trying to figure out an algorithm to minimize the regret. We can define regret as follows:
$ \textrm{Regret: } R_T = \max_i \theta_i^*T - \mathbb{E}\left [ \sum_{t=1}^Tr_t \right], $
which means the difference of the most reward can be gotten in expectation and real rewards in each single time. This means how much worse the player is doing than playing the optimal policy.
UCB
Considering the following problems:
$ \textrm{Regret: } R_T = \max_i \theta_i^*T - \mathbb{E}\left [ \sum_{t=1}^Tr_t \right] $
\begin{algorithm}
\caption{Upper Confidence Bound (UCB)}
\begin{algorithmic}
\Require $N$ arms of slot machines
\PROCEDURE{UCB}{$N$}
\STATE \textbf{Initialize:} Count $C_n^0 \leftarrow 0$; Estimate maen $\hat{\theta}_n^0 \leftarrow 0$
\FOR{$t =1, ..., T$}
\IF{$t \leq N$}
\STATE Play arm t
\STATE $C_n^t \leftarrow \mathbb{I} \left [n = t \right ]$
\STATE Observe $r_t$
\STATE $\hat{\theta}_n^t \leftarrow r_t \cdot \mathbb{I} \left [ n = t \right ]$
\ELSE
\STATE Play arm $I_t = \argmax_{n \in \left [ N \right]} \left ( \hat{\theta}_n^{t-1} + U_n^{t-1} \right )$
\STATE $C_n^t \leftarrow C_n^{t-1} + \mathbb{I} \left [ n = I_t \right]$
\STATE Observe $z_t$
\STATE $\hat{\theta}_n^t \leftarrow \frac{r_t + C_n^{t-1} \cdot \hat{\theta}_n^{t-1}}{C_n^t} \cdot \mathbb{I} \left [n = I_t \right] + \hat{\theta}_n^{t-1} \cdot \left ( 1 - \mathbb{I} \left [n = I_t \right ] \right )$
\ENDIF
\ENDFOR
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Linear Bandits
\begin{algorithm}
\caption{Linear Bandits}
\begin{algorithmic}
\Require $\mathcal{X}$
\PROCEDURE{LinearBandits}{$N$}
\FOR{$t =1, ..., T$}
\STATE Algorithms pick $x_t \in \mathcal{X}$
\STATE Nature reveals $y_t = \langle x, \theta^* \rangle + \xi_t$
\ENDFOR
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Contextual Bandits
A set $C$ is *convex* if for all $x,y \in C$ and for all $\alpha \in [0,1]$ the point $\alpha x + (1-\alpha) y \in C$.
This is a test.
There are no natural numbers $\naturals = (1, 2, 3, \ldots)$ $x$, $y$, and $z$ such that $x^n + y^n = z^n$, in which $n$ is a natural number greater than 2.
References